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Advanced Optimization Algorithms 


These algorithms are able to get logistic regression to run 
much more quickly than it's possible with gradient descent. 


They also scale much better to very large machine learning 
problems, such as if we had a very large number of features. 


Advanced Optimization Algorithms 

Cost function J{9). Want min# J{6) . 
Given 6 , we have code that can compute 
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Gradient descent: 
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Optimization algorithm 

Given 0 , we have code that can compute 

- m 

(for j — 0, 1, . . . , n ) 


Optimization algorithms: 

- Gradient descent 

- Conjugate gradient 


- BFGS 

- L-BFGS 


Advantages: 

- No need to manually pick a 

- Often faster than gradient 
descent. 

Disadvantages: 

- More complex 



options = optimset ( 'GradObj ' , 'on', 'Maxi ter ' , '100'); 

initialTheta = zeros (2,1); 

[optTheta, functionVal, exitFlag] . . . 

= fminunc (QcostFunction, initialTheta, options); 


f minunc attempts to find a minimum of a scalar function of several variables, 
starting at an initial estimate. 

[optTheta, functionVal, exitFlag] . . . 

= fminunc (QcostFunction, initialTheta, options); 

The function arguments: 

1. The objective function to be minimized (the cost function) 

2. The initial values of the function variables (parameters 0 ) 

3. The optimization options specified in options 

The function return: 

1. The optimal values of the function variables (parameters 0 ) at which the local 
minimum of the function. 

2. The value of the objective function at the optimal values of the parameters. 

3. A value exitFlag that describes the exit condition. 


theta = 


0 1 
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function [jVal, gradient] = costFunct ion (theta) 


jVal = [ code to compute J(0)] ; 
gradient (1) = [code to compute 


d 


30 

d 


o 


J(6>)] ; 


gradient(2) = [code to compute J(0)] ; 


= [code to compute ^ -J(0) ] ; 


gradient (n+1) 



Machine Learning 


Regularization 


The problem 
of overfitting 



Example: Linear regression (housing prices^ 



Under fitting Just Right 


Over fitting 


Overfitting: If we have too many features, the learned hypothesis 
may fit the training set very well (j(9) = ^ £ (ft«(x (i) ) - y (i) ) 2 « o), but fail to 
generalize to new examples (predict prices on new examples). 


Example: Logistic regression 



ho{x) — g(6o + d\X\ + O2X2) g{6 0 + 6\X\ + 62X2 

( g= sigmoid function) +$3^1 + ^4^2 

+65X1X2) 


g(6 0 + 6\X\ + 62X1 
+63X1X2 + 64X1X2 
+65X1X2 + 6 6 x 1X2 + . . . ) 



x\ = size of house 


x 2 = no. of bedrooms 
£3 = no. of floors 
£4 = age of house 

•X’5 = average income in neighborhood 
xq = kitchen size 


xioo 


dressing over fitting: 


Options: 

1. Reduce number of features. 

— Manually select which features to keep. 

— Model selection algorithm. 

— Has disadvantage. 

2. Regularization. 

— Keep all the features, but reduce 
magnitude/values of parameters 0 •. 

— Works well when we have a lot of features, each of 
which contributes a bit in predicting y . 
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Regularization 


Cost function 



Price 


Intuition 
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Size of house 


Size of house 


Oq H - 0 \X “h $2^^ Oq 0 \X O2X ^ + O^X^ O4.X 

Suppose we penalize and make # 3 > O4 really small. 



Small values for parameters 9 0 , 0 1 , . . . , 0 n 
— “Simpler” hypothesis 
— Less prone to overfitting 


— Features: xi, X 2 , ■ ■ ■ , £100 

— Parameters: 6 0 , 9 X , 0 2 , . . . , 6hoo 
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egularization f 
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The extra regularization term at the end is used to shrink every 
single parameter 0(used to penalize the parameters 6 1 ■). 

A is the regularization parameter that controls a trade off between 
two different goals. 

• The objective goal of fitting the training data well. 

•And the goal of keeping the parameters small and therefore 
keeping the hypothesis relatively simple to avoid overfitting. 



In regularized linear regression, we choose- 0 to minimize 


m = 
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Y2(h e (xU)-y(i)f + \j2e] 
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3 = 1 


What if A is set to an extremely large value (means penalizing#, 
heavily)?.... perhaps for too large, assume for our problemA = 10 10 
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This means 0i«o, 02«o, 
so h(0)= 0o 
Underfitting problem 


x 


Size of house 


00 + 6 \X + 02X 2 + 63 X 3 + $4X 4 



Regularization 

Regularized 
linear regression 
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m = : 

min J(9) 
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Repeat { 


m 

9q := 0 o -a±J2 (M^ W ) ~ y {l) ) x 
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Oj := Oj - a (M® W ) - y&)x 
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U =58, 1,2,3, ... ,n) 


} 
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Regularization 


Machine Learning 


Regularized 
logistic regression 



egularized logistic regression. 



Kq(x) — g(0Q “ 1 “ $ 1^1 + $ 2^1 
-\- 9 zx\x 2 + O4X1X2 

+^5X^2 + . . . ) 
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Repeat { 


m 
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optimization 


function [jVal, gradient] = costFunction (theta) 
jVal = [ code to compute J(0)\ ; 


m = 


m 

“ £ V W log (hg(x^) + (1 - y^) log 1 - h e (x^) 


i—1 
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gradient(l) = [code to compute ] ; 

777 / 

£E(M* (i) )-» (i) )4 0 


i=l 


gradient (2) = [code to compute de 1 m ] ; 

777 / 

sEW* (i) )-» (i) )4 0 -^ 

i=l ^ 

gradient(3) = [code to compute ] ; 

777 / 

. 7=1 

gradient (n+1) = [code to compute ] / 




